## 📦 Install Dependencies

In [1]:
!pip install faiss-cpu sentence-transformers transformers PyPDF2

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-tran

## 📄 Load and Clean PDF

In [16]:

import re
from PyPDF2 import PdfReader

reader = PdfReader("white_paper.pdf")
raw_text = ""
for page in reader.pages:
    text = page.extract_text()
    if text:
        raw_text += text + " "

# Clean
cleaned_text = re.sub(r'\s+', ' ', raw_text)
cleaned_text = re.sub(r'(?<=\w)- (?=\w)', '', cleaned_text)
print(f"✅ Length of cleaned text: {len(cleaned_text)} characters")


✅ Length of cleaned text: 44338 characters


## ✂️ Split Document into Chunks (~300 words)

In [17]:
context = cleaned_text  # use full cleaned text from PDF


## 🧠 Encode Chunks and Save FAISS

## 📡 Load QA and CrossEncoder

In [15]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline

model_name = "valhalla/longformer-base-4096-finetuned-squadv1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/757 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/595M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/595M [00:00<?, ?B/s]

Some weights of the model checkpoint at valhalla/longformer-base-4096-finetuned-squadv1 were not used when initializing LongformerForQuestionAnswering: ['longformer.pooler.dense.bias', 'longformer.pooler.dense.weight']
- This IS expected if you are initializing LongformerForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LongformerForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


## 🧠 Question Answering with Reranked Context Fusion

In [19]:
question = "What is high-quality primary care?"
result = qa_pipeline({"question": question, "context": context})

print(f"🔹 Question: {question}")
print(f"🧠 Answer: {result['answer']}")
print(f"📊 Confidence Score: {round(result['score'], 3)}")


Input ids are automatically padded to be a multiple of `config.attention_window`: 512


🔹 Question: What is high-quality primary care?
🧠 Answer: comprehensive
📊 Confidence Score: 0.979


In [20]:
questions = [
    "What is high-quality primary care?",
    "What are the essential elements of primary care?",
    "What role does the government play in strengthening it?"
]

for q in questions:
    result = qa_pipeline({"question": q, "context": context})
    print(f"\n🔹 Question: {q}")
    print(f"🧠 Answer: {result['answer']}")
    print(f"📊 Score: {round(result['score'], 3)}")



🔹 Question: What is high-quality primary care?
🧠 Answer: comprehensive
📊 Score: 0.979

🔹 Question: What are the essential elements of primary care?
🧠 Answer: survey, health process and outcome metrics
📊 Score: 0.83

🔹 Question: What role does the government play in strengthening it?
🧠 Answer: Primary Care
📊 Score: 0.635


## 🧪 Evaluation Set