# üìò PDF Embedding & Semantic Search Project

This notebook demonstrates how to:
1. Load PDF documents
2. Split text into chunks
3. Generate embeddings
4. Store vectors in Pinecone
5. Perform semantic search queries

---

## üîß 1. Install and Import Dependencies

Here we install and import the required libraries:
- `langchain-community` for document loading & splitting
- `openai` for embeddings
- `pinecone` for vector database
- Utility libraries (`uuid`, `time`, etc.)


In [1]:
!pip install pinecone
!pip install openai
!pip install langchain_openai
%pip install -qU langchain-community pypdf

Note: you may need to restart the kernel to use updated packages.


In [2]:
from pinecone import Pinecone, ServerlessSpec
from openai import OpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import TokenTextSplitter
import uuid
from config import OPENAI_API_KEY, PINECONE_API_KEY

## üóÑÔ∏è 2. Initialize Pinecone
- Set the Pinecone API key and create a client instance.  
- Define the index name and check whether it already exists. If not, create a new index with the correct `dimension` (must match the embedding model output).  
- Connect to the index so that later embeddings can be uploaded and searched.  
- For more details, see the [Pinecone documentation](https://docs.pinecone.io/).


In [11]:
pc = Pinecone(api_key = PINECONE_API_KEY)

index_name = "your-index"

if not any(i["name"] == index_name for i in pc.list_indexes().get("indexes", [])):
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )
    time.sleep(5)

index = pc.Index(index_name)

print("="*30)
print("Pinecone initialization: OK")
print("="*30)
print()

Pinecone initialization: OK



## üß† 3. Initialize Embedding Model
- Configure the OpenAI API key.  
- Choose the embedding model (`text-embedding-3-large` or `text-embedding-3-small`).
- Wrap the model with `OpenAIEmbeddings` for use in LangChain.  
- Please refer to the [OpenAI embedding document](https://platform.openai.com/docs/guides/embeddings) for more details.


In [4]:
embedding_model = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=OPENAI_API_KEY)

print("="*30)
print("OpenAI initialization: OK")
print("="*30)
print()

OpenAI initialization: OK



## üõ†Ô∏è 3.5 Helper Functions
- Define small reusable functions to improve workflow. Examples:  
  - A pretty-print function like `print_match(res)` to display query results.  


In [5]:
def print_match(result):
    for match in result['matches']:
        print("="*60)
        print(f"Score: {match['score']:.2f} \t Source: {match['metadata']['source']} \t Page: {int(match['metadata']['page'])}")
        print("="*60)
        print(f"{match['metadata']['text']}")
        print("="*60)
        print()

## üìÇ 4. Load, Chunk, and Upload PDF Data

### Step 1: Load PDFs

`PyPDF` is used to load the PDFs and [Tiktoken Splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token#tiktoken) is used to split the document by tokens. This tokenizer is created by OpenAI, so it is more accurate for OpenAI models. 

- `chunk_size=60`: Combine 60 tokens into a chunk in order to make each chunk has a reasonable amount of context. Tested with `40`, `50`, `60`, and `100`, and `60` offers a reasonable context size and a search resolution.
- `chunk_overlap=15`: Considering most papers have two columns, so there will be more cut-off and broken texts. 1/4 of the chunk size is used for overlapping to mitigate the effect of cut-off and broken texts.

### Step 2: Embedding

Then, each chunk is embedded to a vector with a dimension of 1536 using the OpenAI embedding model mentioned above. However, in order to upload the vectors to Pinecone, the vectors has to be in [this format](https://docs.pinecone.io/docs/python-client#indexupsert). 

```
upsert_response = index.upsert(
   vectors=[
       {'id': "vec1", "values":[0.1, 0.2, 0.3, 0.4], "metadata": {'genre': 'drama'}},
       {'id': "vec2", "values":[0.2, 0.3, 0.4, 0.5], "metadata": {'genre': 'action'}},
   ],
   namespace='example-namespace'
)
```

- Here, `str(uuid.uuid4())` is used as `id` instead of a string of an incrementing integer since the number of vectors is huge. Approximately, 500 vectors per PDF. 
- The original text, file name, and the page number of the text are stored as metadata.
  - For demo purpose, only this information is stored. More information such as paper title, publish date, and authors etc. can be stored as metadata.

### Step 3: Batch Upload

Due to efficiency, vectors are uploaded to Pinecone in batches. By default, the batch size is `32`. 


In [6]:
PAPER_LIST = ['Data/Research_01.pdf','Data/Research_02.pdf','Data/Research_03.pdf']

In [7]:
# For ilteration:
CHUNK_SIZE = 60
CHUNK_OVERLAP = 15
UPLOAD_BATCH_SIZE = 32

# Loop
for file_path in PAPER_LIST:
    loader = PyPDFLoader(file_path)
    pages = loader.load_and_split()
    print(f"Processing [{file_path}]")
    print(f"Pages shape: {len(pages)}")

    text_splitter = TokenTextSplitter(
        chunk_size=CHUNK_SIZE, 
        chunk_overlap=CHUNK_OVERLAP
    )

    source = pages[0].metadata["source"]

    total_sentences = []
    page_number_list = []
    for idx, page in enumerate(pages):
        page_num = page.metadata["page"] + 1
        sentences = text_splitter.split_text(page.page_content)
        total_sentences += sentences
        page_number_list += [page_num] * len(sentences)

    # Due to OpenAPI rate limitation, I have to embed multiple chunks at the same time
    paper_embedding = embedding_model.embed_documents(total_sentences)

    # Reformat the vectors
    to_upsert = []
    for i, sentence_vector in enumerate(paper_embedding):
        to_upsert.append({
            "id": str(uuid.uuid4()),
            "values": sentence_vector,
            "metadata": {
                            "text": total_sentences[i],
                            "source": source,
                            "page": page_number_list[i]
                        }
        })

    # Upload the vectors in baches
    batch_size = UPLOAD_BATCH_SIZE
    n = len(to_upsert)
    print(f"Total number: {n}")

    for i in range(0, n, batch_size):
        if i + batch_size <= n:
            batch = to_upsert[i: i+batch_size]     
        else:
            batch = to_upsert[i:]

        index.upsert(vectors=batch,namespace="customer_01")
        print(f"Uploaded batch [{i} : {min(n, i+batch_size)}]")

Processing [Data/Research_01.pdf]
Pages shape: 22
Total number: 340
Uploaded batch [0 : 32]
Uploaded batch [32 : 64]
Uploaded batch [64 : 96]
Uploaded batch [96 : 128]
Uploaded batch [128 : 160]
Uploaded batch [160 : 192]
Uploaded batch [192 : 224]
Uploaded batch [224 : 256]
Uploaded batch [256 : 288]
Uploaded batch [288 : 320]
Uploaded batch [320 : 340]
Processing [Data/Research_02.pdf]
Pages shape: 22
Total number: 344
Uploaded batch [0 : 32]
Uploaded batch [32 : 64]
Uploaded batch [64 : 96]
Uploaded batch [96 : 128]
Uploaded batch [128 : 160]
Uploaded batch [160 : 192]
Uploaded batch [192 : 224]
Uploaded batch [224 : 256]
Uploaded batch [256 : 288]
Uploaded batch [288 : 320]
Uploaded batch [320 : 344]
Processing [Data/Research_03.pdf]
Pages shape: 27
Total number: 496
Uploaded batch [0 : 32]
Uploaded batch [32 : 64]
Uploaded batch [64 : 96]
Uploaded batch [96 : 128]
Uploaded batch [128 : 160]
Uploaded batch [160 : 192]
Uploaded batch [192 : 224]
Uploaded batch [224 : 256]
Uploaded b

## ‚úÖ 5. Testing
- Define test queries to check whether the system retrieves relevant chunks.

### Test 1

In [8]:
query = "What are the most common anatomical types of ACHD in adults, and their typical management strategies?"
print("="*30)
print(f"Test 1: {query}")
print("="*30)

query_embedding = embedding_model.embed_documents([query])[0]

res = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True,
    namespace="customer_01"
)

print_match(res)

Test 1: What are the most common anatomical types of ACHD in adults, and their typical management strategies?
Score: 0.69 	 Source: Data/Research_02.pdf 	 Page: 2
discovered in adults, issues related to selected ACHD groups 
with previous palliative or corrective surgeries and special 
considerations in ACHD patient population. A concise 
discussion of diseases is beyond the scope of this text and the 
readers are directed to

Score: 0.69 	 Source: Data/Research_02.pdf 	 Page: 2
discovered in adults, issues related to selected ACHD groups 
with previous palliative or corrective surgeries and special 
considerations in ACHD patient population. A concise 
discussion of diseases is beyond the scope of this text and the 
readers are directed to

Score: 0.68 	 Source: Data/Research_01.pdf 	 Page: 2
. 
Complications in Adulthood 
The ACHD population represents a diverse popula -
tion in terms of severity of CHD, history of surgical/  
catheter-based interventions, and socioeconomic status. 


### Test 2

In [9]:
query = "How is the transition from pediatric to adult congenital heart disease care typically managed?"
print("="*30)
print(f"Test 1: {query}")
print("="*30)

query_embedding = embedding_model.embed_documents([query])[0]

res = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True,
    namespace="customer_01"
)

print_match(res)

Test 1: How is the transition from pediatric to adult congenital heart disease care typically managed?
Score: 0.74 	 Source: Data/Research_01.pdf 	 Page: 11
comes of transition from pediatric to adult health care services for 
young people with congenital heart disease: a systematic review. 
Congenit Heart Dis. 2015;10413-427.
35. Reid GJ, Irvine MJ, McCrindle BW, et al. Preval

Score: 0.74 	 Source: Data/Research_01.pdf 	 Page: 11
comes of transition from pediatric to adult health care services for 
young people with congenital heart disease: a systematic review. 
Congenit Heart Dis. 2015;10413-427.
35. Reid GJ, Irvine MJ, McCrindle BW, et al. Preval

Score: 0.74 	 Source: Data/Research_01.pdf 	 Page: 1
ACT
Objective: To review the management of patients with 
congenital heart disease (CHD) transitioning from 
pediatric to adult care.
Methods: Review of the literature.
Results: Persons with CHD require close monitoring and 
evaluation throughout life to address



### Test 3

In [10]:
query = "Which ACHD lesions are associated with aortopathy or risk of heart failure in adulthood?"
print("="*30)
print(f"Test 1: {query}")
print("="*30)

query_embedding = embedding_model.embed_documents([query])[0]

res = index.query(
    vector=query_embedding,
    top_k=3,
    include_metadata=True,
    namespace="customer_01"
)

print_match(res)

Test 1: Which ACHD lesions are associated with aortopathy or risk of heart failure in adulthood?
Score: 0.63 	 Source: Data/Research_01.pdf 	 Page: 2
 sudden cardiac death (19%), 
which occur at mean ages of 48 years and 39 years, re-
spectively.
10 The form of heart failure in ACHD patients is 
related to subsystemic right ventricle (RV) dysfunction, 
coronary

Score: 0.63 	 Source: Data/Research_01.pdf 	 Page: 2
 sudden cardiac death (19%), 
which occur at mean ages of 48 years and 39 years, re-
spectively.
10 The form of heart failure in ACHD patients is 
related to subsystemic right ventricle (RV) dysfunction, 
coronary

Score: 0.62 	 Source: Data/Research_01.pdf 	 Page: 2
. 
Complications in Adulthood 
The ACHD population represents a diverse popula -
tion in terms of severity of CHD, history of surgical/  
catheter-based interventions, and socioeconomic status. 
However, a unifying clinical concern for these

