<a href="https://colab.research.google.com/github/arun41687/naive_rag_hf_no_token_hybrid/blob/main/colab_rag_hf_no_token_hybrid.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Set up a RAG pipeline in Colab using the "https://github.com/arun41687/naive_rag_hf_no_token_hybrid" repository, ingest two provided PDF documents, configure it with the Hugging Face Phi3-mini-4k-instruct LLM, and test its ability to retrieve correct answers from the PDFs.

## Clone Git Repository

### Subtask:
Clone the provided Git repository (https://github.com/arun41687/naive_rag_hf_no_token_hybrid.git) into the Colab environment to access the codebase.


In [1]:
# Clone repo (skip if already in /content/naive_rag)
import os, subprocess, textwrap
repo_dir = "/content/naive_rag"
if not os.path.exists(repo_dir):
    # subprocess.run(["git", "clone", "https://github.com/arun41687/rag_hf_no_token_pymupdf4llm.git", repo_dir], check=True)
    subprocess.run(["git", "clone", "https://github.com/arun41687/naive_rag_hf_no_token_hybrid.git", repo_dir], check=True)
os.chdir(repo_dir)
print("Repo ready:", os.getcwd())

Repo ready: /content/naive_rag


## Install Dependencies and Setup Environment

### Subtask:
Navigate into the cloned repository, install all necessary Python dependencies (e.g., from a 'requirements.txt' file if present), and ensure numpy and other specified libraries are correctly installed.


In [2]:
# Install dependencies
!pip install -q transformers accelerate bitsandbytes
!pip install -q sentence-transformers faiss-cpu pdfplumber


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m85.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m100.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m104.9 MB/s[0m eta [36m0:00:00[0m
[?25h

## Setup Input Data to parse

Configure paths to access PDF files from Kaggle dataset.

**Required Datasets**:
- Apple 10-K: `10-Q4-2024-As-Filed.pdf`
- Tesla 10-K: `tsla-20231231-gen.pdf`

**Instructions**: Add these datasets to your notebook under **Add Data** → **Datasets**

In [4]:
# Upload PDFs if they are not already in the repo (Colab only)
import os
try:
    from google.colab import files
    uploaded = files.upload()
    for name in uploaded.keys():
        print("Uploaded:", name)
except Exception as exc:
    print("Upload skipped (not running in Colab):", exc)
print("Current dir:", os.getcwd())
print("PDFs:", [f for f in os.listdir('.') if f.lower().endswith('.pdf')])

Saving 10-Q4-2024-As-Filed.pdf to 10-Q4-2024-As-Filed.pdf
Saving tsla-20231231-gen.pdf to tsla-20231231-gen.pdf
Uploaded: 10-Q4-2024-As-Filed.pdf
Uploaded: tsla-20231231-gen.pdf
Current dir: /content/naive_rag
PDFs: ['tsla-20231231-gen.pdf', '10-Q4-2024-As-Filed.pdf']


## Initialize RAG System

Initialize the RAG system with HuggingFace mistralai/Mistral-7B-Instruct-v0.2.

**Note:** First run will download:
- Sentence transformer model (~90MB)
- Cross-encoder model (~80MB)  
- mistralai/Mistral-7B-Instruct-v0.2 model (~14GB without quantization)

This may take 5-10 minutes depending on your internet connection.

## Ingest PDF Documents

Parse PDFs, create chunks, generate embeddings, and build FAISS index and save them

In [6]:
from rag_system import RAGSystem

documents = [
    {"path": "10-Q4-2024-As-Filed.pdf", "name": "Apple 10-K"},
    {"path": "tsla-20231231-gen.pdf", "name": "Tesla 10-K"},
]

rag = RAGSystem(
    model_name="mistralai/Mistral-7B-Instruct-v0.2",
    embedding_model="all-MiniLM-L6-v2",
    use_reranker=True,
    use_advanced_reranker=True
)

Downloading required NLTK data...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Using advanced stopword+keyword reranker


config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/201 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-12-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [7]:
rag.ingest_documents(documents)
rag.save_index("./rag_index")

Starting document ingestion...
Processing Apple 10-K from 10-Q4-2024-As-Filed.pdf...
  Created 745 chunks
Processing Tesla 10-K from tsla-20231231-gen.pdf...
  Created 794 chunks
Total chunks created: 1539
Creating embeddings and indexing...
Indexing complete!
Index saved to ./rag_index


## Test with Sample Query

Test the system with a single question before running full evaluation.

In [8]:
question = "What was Apples total revenue for the fiscal year ended September 28, 2024?"
result = rag.answer_question(question)

print(f"Question: {question}")
print(f"Answer: {result['answer'][:100]}...")
print(f"Sources: {result['sources']}\n")


🔄 Loading LLM (first query)...
🌐 Loading model from HuggingFace: mistralai/Mistral-7B-Instruct-v0.2
   (Will use cached version if available)
ℹ️  No HuggingFace token found (optional - only needed for gated models like Phi-3)


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

📝 Loading model config...
⚡ Loading model in 4-bit quantized mode


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/291 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

✅ LLM loaded successfully!



Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Question: What was Apples total revenue for the fiscal year ended September 28, 2024?
Answer: The total revenue for Apple for the fiscal year ended September 28, 2024 was $391,035 million. ([App...
Sources: ['Apple 10-K, p. 33', 'Apple 10-K, p. 41', 'Apple 10-K, p. 50', 'Apple 10-K, p. 38', 'Apple 10-K, p. 31', 'Apple 10-K, p. 46', 'Apple 10-K, p. 118', 'Apple 10-K, p. 30', 'Apple 10-K, p. 35']



In [9]:
question = "What is the total amount of term debt (current + non-current) reported by Apple as of September 28, 2024?"
result = rag.answer_question(question)

print(f"Question: {question}")
print(f"Answer: {result['answer'][:100]}...")
print(f"Sources: {result['sources']}\n")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Question: What is the total amount of term debt (current + non-current) reported by Apple as of September 28, 2024?
Answer: The total amount of term debt reported by Apple as of September 28, 2024 is ($13,505) million for cu...
Sources: ['Apple 10-K, p. 33', 'Apple 10-K, p. 41', 'Apple 10-K, p. 50', 'Apple 10-K, p. 38', 'Apple 10-K, p. 48', 'Apple 10-K, p. 46', 'Apple 10-K, p. 40']



## Run Full Evaluation

Run evaluation on all 13 test questions.

In [10]:
# Optional: run full evaluation
from rag_system import run_evaluation
run_evaluation(rag)


RUNNING EVALUATION ON 13 TEST QUESTIONS

Q1: What was Apples total revenue for the fiscal year ended September 28, 2024?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: The total revenue for Apple for the fiscal year ended September 28, 2024 was $391,035 million. ([App...
Sources: ['Apple 10-K, p. 33', 'Apple 10-K, p. 41', 'Apple 10-K, p. 50', 'Apple 10-K, p. 38', 'Apple 10-K, p. 31', 'Apple 10-K, p. 46', 'Apple 10-K, p. 118', 'Apple 10-K, p. 30', 'Apple 10-K, p. 35']

Q2: How many shares of common stock were issued and outstanding as of October 18, 2024?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: The number of shares of common stock issued and outstanding as of October 18, 2024 was 15,115,823,00...
Sources: ['Apple 10-K, p. 21', 'Apple 10-K, p. 64', 'Tesla 10-K, p. 2', 'Apple 10-K, p. 34', 'Apple 10-K, p. 47', 'Apple 10-K, p. 2', 'Tesla 10-K, p. 96', 'Apple 10-K, p. 35', 'Tesla 10-K, p. 50']

Q3: What is the total amount of term debt (current + non-current) reported by Apple as of September 28, 2024?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: The total amount of term debt reported by Apple as of September 28, 2024 is ($13,505) million for cu...
Sources: ['Apple 10-K, p. 33', 'Apple 10-K, p. 41', 'Apple 10-K, p. 50', 'Apple 10-K, p. 38', 'Apple 10-K, p. 48', 'Apple 10-K, p. 46', 'Apple 10-K, p. 40']

Q4: On what date was Apples 10-K report for 2024 signed and filed with the SEC?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: The Apple 10-K report for 2024 was signed and filed with the SEC on November 1, 2024. This is stated...
Sources: ['Apple 10-K, p. 59', 'Apple 10-K, p. 115', 'Apple 10-K, p. 57', 'Apple 10-K, p. 116', 'Apple 10-K, p. 1', 'Apple 10-K, p. 118', 'Apple 10-K, p. 60', 'Apple 10-K, p. 56', 'Apple 10-K, p. 2']

Q5: Does Apple have any unresolved staff comments from the SEC as of this filing? How do you know?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: According to the provided context, there are no unresolved staff comments from the SEC mentioned in ...
Sources: ['Apple 10-K, p. 59', 'Apple 10-K, p. 8', 'Apple 10-K, p. 110', 'Apple 10-K, p. 1', 'Apple 10-K, p. 118', 'Apple 10-K, p. 112', 'Apple 10-K, p. 2', 'Apple 10-K, p. 20']

Q6: What was Teslas total revenue for the year ended December 31, 2023?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: Tesla's total revenue for the year ended December 31, 2023 was $96,773 million [Tesla 10-K - Page 51...
Sources: ['Tesla 10-K, p. 41', 'Tesla 10-K, p. 94', 'Tesla 10-K, p. 51', 'Tesla 10-K, p. 89', 'Tesla 10-K, p. 56', 'Tesla 10-K, p. 60', 'Tesla 10-K, p. 42', 'Tesla 10-K, p. 53', 'Tesla 10-K, p. 50']

Q7: What percentage of Teslas total revenue in 2023 came from Automotive Sales (excluding Leasing)?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: To find the percentage of Tesla's total revenue in 2023 that came from Automotive Sales (excluding L...
Sources: ['Tesla 10-K, p. 41', 'Tesla 10-K, p. 51', 'Tesla 10-K, p. 56', 'Tesla 10-K, p. 39', 'Tesla 10-K, p. 40', 'Tesla 10-K, p. 50']

Q8: What is the primary reason Tesla states for being highly dependent on Elon Musk?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: Tesla states that it is highly dependent on Elon Musk, its Chief Executive Officer, because he spend...
Sources: ['Tesla 10-K, p. 91', 'Tesla 10-K, p. 123', 'Tesla 10-K, p. 22', 'Tesla 10-K, p. 89', 'Tesla 10-K, p. 21', 'Tesla 10-K, p. 29']

Q9: What types of vehicles does Tesla currently produce and deliver?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: Tesla currently produces and delivers Model S, Model X, Model 3, and Model Y vehicles [Tesla 10-K - ...
Sources: ['Tesla 10-K, p. 54', 'Tesla 10-K, p. 34', 'Tesla 10-K, p. 8', 'Tesla 10-K, p. 6', 'Tesla 10-K, p. 20', 'Tesla 10-K, p. 5', 'Tesla 10-K, p. 35', 'Tesla 10-K, p. 17']

Q10: What is the purpose of Teslas 'lease pass-through fund arrangements'?


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Answer: Tesla's lease pass-through fund arrangements are financial structures where their wholly-owned subsi...
Sources: ['Tesla 10-K, p. 9', 'Tesla 10-K, p. 54', 'Tesla 10-K, p. 13', 'Tesla 10-K, p. 20', 'Tesla 10-K, p. 56', 'Tesla 10-K, p. 59', 'Tesla 10-K, p. 82']

Q11: What is Teslas stock price forecast for 2025?
Answer: This question cannot be answered based on the provided documents....
Sources: []

Q12: Who is the CFO of Apple as of 2025?
Answer: This question cannot be answered based on the provided documents....
Sources: []

Q13: What color is Teslas headquarters painted?
Answer: This question cannot be answered based on the provided documents....
Sources: []


Evaluation complete! Results saved to evaluation_results_20260216_010250.json


[{'question_id': 1,
  'answer': 'The total revenue for Apple for the fiscal year ended September 28, 2024 was $391,035 million. ([Apple 10-K - Page 38])',
  'sources': ['Apple 10-K, p. 33',
   'Apple 10-K, p. 41',
   'Apple 10-K, p. 50',
   'Apple 10-K, p. 38',
   'Apple 10-K, p. 31',
   'Apple 10-K, p. 46',
   'Apple 10-K, p. 118',
   'Apple 10-K, p. 30',
   'Apple 10-K, p. 35']},
 {'question_id': 2,
  'answer': 'The number of shares of common stock issued and outstanding as of October 18, 2024 was 15,115,823,000, as stated in the Apple 10-K document on page 2. [Apple 10-K - Page 2]',
  'sources': ['Apple 10-K, p. 21',
   'Apple 10-K, p. 64',
   'Tesla 10-K, p. 2',
   'Apple 10-K, p. 34',
   'Apple 10-K, p. 47',
   'Apple 10-K, p. 2',
   'Tesla 10-K, p. 96',
   'Apple 10-K, p. 35',
   'Tesla 10-K, p. 50']},
 {'question_id': 3,
  'answer': "The total amount of term debt reported by Apple as of September 28, 2024 is ($13,505) million for current term debt and $91,493 million for non-cur