# Step 2: Extract Data from PDF
The first step is to extract the text/data from the PDF.


## Prerequisites

Please [download](https://www.researchgate.net/profile/Gary-Clark/publication/19364043_Slamon_DJ_Clark_GM_Wong_SG_Levin_WJ_Ullrich_A_McGuire_WLHuman_breast_cancer_correlation_of_relapse_and_survival_with_amplification_of_the_HER-2neu_oncogene_Science_Wash_DC_235_177-182/links/0046352b85f241a532000000/Slamon-DJ-Clark-GM-Wong-SG-Levin-WJ-Ullrich-A-McGuire-WLHuman-breast-cancer-correlation-of-relapse-and-survival-with-amplification-of-the-HER-2-neu-oncogene-Science-Wash-DC-235-177-182.pdf) and place PDF with the name unchanged 
(`SlamonetalSCIENCE1987.pdf`) in the `data/` directory within this project's root.


# Load
After trying few different off the shelf loaders, the one that looked the most
reliable was `PyMuPDFLoader`.

In [None]:
import pymupdf
from pathlib import Path
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyMuPDFLoader

dpath = Path.cwd() / "data"
fname = "SlamonetalSCIENCE1987.pdf"

fpath = dpath / fname

# pages = pymupdf.open(fpath, filetype="txt")

# loader = PyPDFLoader(fpath)
loader = PyMuPDFLoader(fpath)
documents = loader.load()

**Note**: When opening and viewing the PDF, you'll notice that the content of 
the article is entirely within the 3rd (index 2) and the second to last page 
(index -1):
1. The content of the first page (index 0) is just the title, authors, and some
metadata that's not relevant to the context we need.
1. The second page (index 1) just has some logos, DOI, and other related 
articles that are not directly in the content of the article.
1. The last page (index -1) is another article entirely with the 
`References and Notes` section extended from the main article.

For this reason, we're only going to extract the relevant information from those
pages.

In [9]:
n_pages = len(documents)
content = [doc for doc in documents if doc.metadata["page"] not in [0, 1, n_pages - 1]]

# Split Documents

For performance and memory constraints, documents are split into chunks. This
allows for quick retrieval of snippets to find the best answer for a query.

In [None]:
splitter = CharacterTextSplitter(chunk_size=1_024, chunk_overlap=32, separator="\n")

split = splitter.split_documents(content)
split[0]

# Construct Vector Store
Create vector store based on embeddings and document chunks. For simplicity,
let's use the same base model for embedding.

First we construct the vector store from the document chunks and embedding. Then
we save it locally (TODO: Create database to host vector store).

In [None]:
from pathlib import Path
from langchain.vectorstores import FAISS
from langchain_community.llms import Ollama
from langchain_ollama import OllamaEmbeddings
from dotenv import load_dotenv

load_dotenv()


VECTORSTORE_PATH = Path("data") / "vectorstore" / "db_pdf_context"
MODEL_NAME = "deepseek-r1:latest"
OLLAMA_SERVER_URL = "http://localhost:11434"


EMBEDDINGS = OllamaEmbeddings(model=MODEL_NAME, base_url=OLLAMA_SERVER_URL)
# LLM = Ollama(model=MODEL_NAME, base_url=OLLAMA_SERVER_URL)
vectorstore = FAISS.from_documents(split, EMBEDDINGS)
vectorstore.save_local(VECTORSTORE_PATH)

Let's give it a query and see what the vector store retrieves.

In [13]:
retriever = vectorstore.as_retriever()

retrieved_documents = retriever.invoke("What is LangChain?")

retrieved_documents[0].page_content

'HER-2/neu amplification with various disease parameters were performed by the x2\ntest. P values werc computed after combining the 5 to 20 and >20 cases, since there\n-\nwere so few samples in the >20 group.\ni8o\n on January 15, 2007 \nwww.sciencemag.org\nDownloaded from \nBIOM 255 (Leffert) – Discussion Feb. 1, 2007'

Not super relevant, but asking irrelevant questions could be useful in 
evaluating our chatbot!

# Conclusions

If there was more time, it would be good to do some text cleaning since there 
are some issues (some misspellings/typos) in the text extraction in the OCR 
step. There are some extra spaces so some individual tokens appear to multiple 
words. If this is common, and on key words, then garbage in becomes garbage out.

Luckily the chatbot's responses look good so revisiting this is low priority.

Now that we understand the vector store, let's return to the chatbot app and 
ask some questions!