In [6]:
!pip install langchain_community pypdf
!pip install sentence_transformers

Collecting langchain_community
  Downloading langchain_community-0.4.1-py3-none-any.whl.metadata (3.0 kB)
Collecting pypdf
  Downloading pypdf-6.7.1-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-classic<2.0.0,>=1.0.0 (from langchain_community)
  Downloading langchain_classic-1.0.1-py3-none-any.whl.metadata (4.2 kB)
Collecting requests<3.0.0,>=2.32.5 (from langchain_community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain_community)
  Downloading marshmallow-3.26.2-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting langchain-text-splitters<2.0.0,>=1.1.0 (from langchain-classic<2.

**Data loading**

Convert raw data (PDF, text, etc) into structured format

In [7]:
import os
from langchain_community.document_loaders import PyPDFLoader

In [8]:
DOCS_PATH = '/content/drive/MyDrive/RAG/docs'

In [9]:
def load_pdfs(folder_path):
  documents = []

  for file in os.listdir(folder_path):
    if file.endswith('.pdf'):
      loader = PyPDFLoader(os.path.join(folder_path, file))
      docs = loader.load()

      for d in docs:
        d.metadata['source'] = file

      documents.extend(docs)
  return documents

In [10]:
documents = load_pdfs(DOCS_PATH)
print(f"Loaded {len(documents)} pages")

Loaded 48 pages


**Docs Cleaning**

Remove noise + normalize text for better retrieval

In [11]:
import re
def clean_documents(text):
  text = re.sub(r'\n+',' ',text)
  text = re.sub(r'\s+',' ',text)
  return text.strip()

**PII (Personally Identifiable Information) Masking**

Protect sensitive info (names, phone numbers)

In [12]:
def mask_pii(text):
  text = re.sub(r'\b\d{10}\b','[PHONE]',text)
  return text

In [13]:
processed_docs = []
for doc in documents:
  text = clean_documents(doc.page_content)
  text = mask_pii(text)
  processed_docs.append({
      "text":text,
      "source":doc.metadata['source']
  })

**Chunking**

Split into smaller pieces -> Improve retrieval

In [14]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [15]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=80
)

In [16]:
chunks = []
for doc in processed_docs:
  split_texts = splitter.split_text(doc['text'])
  for chunk in split_texts:
    chunks.append({
        "text":chunk,
        "source":doc['source']
    })

In [17]:
chunks[0]

{'text': '1 Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as',
 'source': 'Faster R-CNN.pdf'}

**Embeddings**

Convert text -> vectors for similarity Search

In [18]:
from sentence_transformers import SentenceTransformer

In [19]:
embed_model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [ch['text'] for ch in chunks]
embeddings = embed_model.encode(texts)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**Dense Retrieval (FIASS)**

Uses embeddings (vectors) to capture semantic meaning, not just exact words.

In [22]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.6 kB)
Downloading faiss_cpu-1.13.2-cp310-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m58.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.2


In [23]:
import faiss
import numpy as np

In [25]:
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

**Sparse Retrieval**

Uses keyword matching (TF-IDF, BM25). Based on exact word overlap

In [29]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [30]:
from rank_bm25 import BM25Okapi

In [31]:
tokenized_chunks = [c["text"].split() for c in chunks]
bm25 = BM25Okapi(tokenized_chunks)

**Hybrid Retrieval**

Combine Keyword (Sparse) + Semantic (Dense) Search

In [33]:
def hybrid_search(query, k = 5):
  query_embedding = embed_model.encode([query])

  # Dense
  D, I  = index.search(np.array(query_embedding), k)
  dense_results = [chunks[i] for i in I[0]]

  # Sparse
  bm25_scores = bm25.get_scores(query.split())
  top_sparse_idx = np.argsort(bm25_scores)[-k:]
  sparse_results = [chunks[i] for i in top_sparse_idx]

  # Combine
  combined = dense_results + sparse_results
  return combined

**Query Reformulation**

Improve Bad queries -> Better retrieval

In [34]:
def rewrite_query(query):
  return f"Explain clearly: {query}"

**Reranking**

Re-score retrieved results for better accuracy

In [35]:
from sentence_transformers import CrossEncoder

In [37]:
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, results):
  pairs = [(query, r['text']) for r in results]
  scores = reranker.predict(pairs)

  ranked = sorted(zip(scores, results), reverse=True)
  return [r for _, r in ranked]

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/105 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: cross-encoder/ms-marco-MiniLM-L-6-v2
Key                          | Status     |  | 
-----------------------------+------------+--+-
bert.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

**Caching**

In [38]:
cache = {}

In [39]:
def cached_search(query):
  if query in cache:
    return cache[query]

  results = hybrid_search(query)
  cache[query] = results
  return results

**Secure Retrieval**

Restrice access based on user or context.

In [40]:
def secure_filter(results, allowed_sources):
  return [r for r in results if r['source'] in allowed_sources]

**Multi-Hop Retrieval**

Multiple retrieval steps

In [41]:
def multi_hop(query):
  step1 = hybrid_search(query)

  refined_query = f"Based of above, explain deeper: {query}"

  step2 = hybrid_search(refined_query)

  return step1 + step2

**Prompt + LLM**

Generate answer using retrieved context.

In [42]:
!pip install langchain-openai

Collecting langchain-openai
  Downloading langchain_openai-1.1.10-py3-none-any.whl.metadata (3.1 kB)
Downloading langchain_openai-1.1.10-py3-none-any.whl (87 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.2/87.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-1.1.10


In [43]:
import os
os.environ['OPENAI_API_KEY'] = 'Your-Openai-API-Key'

In [44]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name='gpt-4o-mini',
                 temperature = 0)

In [45]:
def generate_answer(query, context):
  prompt = f'''
  Answer only from context.
  If not found, say "I dont know".

  Context:
  {context}

  Questions:
  {query}
  '''
  response = llm.invoke(prompt)

  return response

**Hallucination Control**

Force model to say "I don't know"

In [46]:
### Already added in prompt as "ONLY from context"

**Evaluation**

Check if correct chunk retrieved

In [47]:
def evaluation(answer, ground_truth):
  return ground_truth.lower() in answer.lower()

**Latency Tracking**

Measure performance based on retrieval speed

In [48]:
import time

In [52]:
start = time.time()
results = hybrid_search("What is ResNet?")
print(f"Latency: {time.time() - start}")

Latency: 0.02143120765686035


**FeedBack Loop**

Improves system over time

In [53]:
feedback = []

def store_feedback(query, answer, correct):
  feedback.append({
      "query":query,
      "answer":answer,
      "correct":correct
  })

**Bias Check**

Detect unfair outputs

In [54]:
def bias_check(answer):
  if "only" in answer and "better" in answer:
    return "Check bias"

### Final Pipeline

In [55]:
def full_pipeline(query):
  query = rewrite_query(query)

  results = cached_search(query)

  results = secure_filter(results, ["ResNet.pdf", "Faster R-CNN.pdf","Vision Transformer (ViT).pdf"])

  results = rerank(query, results)

  context = "\n\n".join(r["text"] for r in results[:3])

  answer = generate_answer(query, context)

  return answer

In [56]:
full_pipeline("What problem does ResNet solve?")

AIMessage(content='ResNet solves the optimization difficulty associated with training deep neural networks. It eases the optimization process by providing faster convergence at the early stages of training, allowing for better performance and accuracy gains as the depth of the network increases.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 45, 'prompt_tokens': 363, 'total_tokens': 408, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_084a28d6e8', 'id': 'chatcmpl-DAzGxjcFEsaVTnuOhrNLuT2hcVOvq', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='lc_run--019c764b-73d7-7e93-b650-3888fc5fa687-0', tool_calls=[], invalid_tool_calls=[], usage_metadata={'input_tokens': 363, '

In [57]:
full_pipeline("How does Faster R-CNN improve object detection?")

AIMessage(content='Faster R-CNN improves object detection by integrating two modules into a single, unified network. The first module is a deep fully convolutional network that proposes regions where objects may be located, while the second module is the Fast R-CNN detector that utilizes these proposed regions for detection. This approach streamlines the process and reduces the running time of detection networks, making it more efficient compared to previous methods that relied on separate region proposal algorithms.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 88, 'prompt_tokens': 304, 'total_tokens': 392, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_provider': 'openai', 'model_name': 'gpt-4o-mini-2024-07-18', 'system_fingerprint': 'fp_373a14eb6f', 'id': 'chatcmpl-DAzI23

In [58]:
full_pipeline("Difference between CNN and Vision Transformer?")

AIMessage(content='The main difference between CNNs (Convolutional Neural Networks) and Vision Transformers (ViT) lies in their inductive biases and architectural components. CNNs incorporate strong image-specific inductive biases such as locality, two-dimensional neighborhood structure, and translation equivariance throughout their layers. This means that CNNs are designed to recognize patterns in local regions of images and maintain spatial hierarchies.\n\nIn contrast, Vision Transformers have much less image-specific inductive bias. In ViTs, only the MLP (Multi-Layer Perceptron) layers are local and translationally equivariant, while the self-attention layers are global, allowing them to consider the entire image at once. Additionally, ViTs use the two-dimensional neighborhood structure sparingly, focusing instead on processing sequences of image patches directly. This allows ViTs to perform well on image classification tasks without relying on the traditional structure of CNNs.', a