This project demonstrates a simple Retrieval-Augmented Generation (RAG) pipeline using Python. It retrieves information from a PDF document, embeds the text into vectors, stores them in a BigQuery Vector Store, and uses a Generative AI model to answer questions based on the document's content.
The script performs the following steps:
- Loads a PDF document from a URL.
- Splits the document's text into smaller chunks.
- Generates embeddings for each chunk using a Hugging Face sentence transformer model.
- Stores the text chunks and their corresponding embeddings in a BigQuery Vector Store.
- Takes a user query and retrieves the most relevant text chunks from BigQuery.
- Uses a Gemini Large Language Model (LLM) to generate a response to the user's query based on the retrieved context.
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Create a virtual environment:
python3 -m venv .venv source .venv/bin/activate -
Install the dependencies:
pip install -r req.txt
-
Set up your environment variables: Create a
.envfile in the root directory of the project and add the following variables:GEMINI_API_KEYS="YOUR_GEMINI_API_KEY" PROJECT_ID="YOUR_GOOGLE_CLOUD_PROJECT_ID" DATASET="YOUR_BIGQUERY_DATASET_NAME" TABLE="YOUR_BIGQUERY_TABLE_NAME" REGION="YOUR_BIGQUERY_DATASET_REGION"Replace the placeholder values with your actual credentials and configuration.
To run the script, execute the main.py file:
python main.pyThe script will then:
- Fetch the PDF and process it. The default PDF is an Nvidia earnings call transcript, but you can change this in
main.py:https://s201.q4cdn.com/141608511/files/doc_financials/2026/q3/NVDA-Q3-2026-Earnings-Call-19-November-2025-5_00-PM-ET.pdf - Store the data in BigQuery.
- Run a sample query: "What is the main topic of the document?".
- Print the generated answer to the console.
Snapshots for reference

