# Upload Papers to Pinecone Vector Storage

## Install Packages

If you are using MacOS, please use `pip3`.

`-qU` means `quiet` and `Upgrade`

In [1]:
!pip install -qU \
    langchain==0.0.276 \
    openai==0.27.10 \
    tiktoken==0.4.0 \
    pinecone-client==2.2.2 \
    wikipedia==1.4.0 \
    pypdf==3.15.4

## Import Packages

In [None]:
from langchain.embeddings import OpenAIEmbeddings   
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import TokenTextSplitter

import pinecone
import time
import uuid

from config import OPENAI_API_KEY, PINECONE_API_KEY, PINECONE_ENVIRONMENT, PINECONE_INDEX_NAME, EMBEDDING_MODEL, SPLITTER_CHUNK_SIZE, SPLITTER_CHUNK_OVERLAP, UPLOAD_BATCH_SIZE

## Global Variable

- `PAPER_LIST`: Store file paths of upload papers into a list

In [3]:
PAPER_LIST = ["data/paper1.pdf", "data/paper2.pdf", "data/paper3.pdf"]

## Helper Functions

In [4]:
def print_match(result):
    for match in result['matches']:
        print("="*60)
        print(f"Score: {match['score']:.2f} \t Source: {match['metadata']['source']} \t Page: {int(match['metadata']['page'])}")
        print("="*60)
        print(f"{match['metadata']['text']}")
        print("="*60)
        print()

## Initialize OpenAI Embedding

Here `text-embedding-ada-002` embedding is used by default. Please refer to [OpenAI embedding document](https://platform.openai.com/docs/guides/embeddings/embedding-models) for more details.


In [5]:
embedding_model = OpenAIEmbeddings(
    openai_api_key=OPENAI_API_KEY, 
    model=EMBEDDING_MODEL
)

print("="*30)
print("OpenAI initialization: OK")
print("="*30)
print()

OpenAI initialization: OK



## Initialize Pinecone

If the index does not exist in your Pinecone, it will automatically create a new one. 

- `metric='cosine'`: This is often used to find similarities between different documents. The advantage is that the scores are normalized to [-1,1] range. You can choose other options listed [here](https://docs.pinecone.io/docs/indexes#distance-metrics).
- `dimension=1536`: The OpenAI `text-embedding-ada-002` embedding has a dimension of 1536
- There is a limitation for the free plan for Pinecone. Please refer to the [starter plan](https://docs.pinecone.io/docs/indexes#starter-plan) for more details

In [6]:
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENVIRONMENT
)

if PINECONE_INDEX_NAME not in pinecone.list_indexes():
    # we create a new index if it doesn't exist
    pinecone.create_index(
        name=PINECONE_INDEX_NAME,
        metric='cosine',
        dimension=1536  # 1536 dim of text-embedding-ada-002
    )
    # wait for index to be initialized
    time.sleep(1)

pinecone_index = pinecone.Index(PINECONE_INDEX_NAME)
pinecone_stats = pinecone_index.describe_index_stats()
print("="*30)
print("Pinecone initialization: OK")
print(pinecone_stats)
print("="*30)
print()

Pinecone initialization: OK
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}



## Upload PDF Files

### Load PDFs

`PyPDF` is used to load the PDFs and [Tiktoken Splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token#tiktoken) is used to split the document by tokens. This tokenizer is created by OpenAI, so it is more accurate for OpenAI models. 

- `chunk_size=60`: Combine 60 tokens into a chunk in order to make each chunk has a reasonable amount of context. Tested with `40`, `50`, `60`, and `100`, and `60` offers a reasonable context size and a search resolution.
- `chunk_overlap=15`: Considering most papers have two columns, so there will be more cut-off and broken texts. 1/4 of the chunk size is used for overlapping to mitigate the effect of cut-off and broken texts.

### Embedding

Then, each chunk is embedded to a vector with a dimension of 1536 using the OpenAI embedding model mentioned above. However, in order to upload the vectors to Pinecone, the vectors has to be in [this format](https://docs.pinecone.io/docs/python-client#indexupsert). 

```
upsert_response = index.upsert(
   vectors=[
       {'id': "vec1", "values":[0.1, 0.2, 0.3, 0.4], "metadata": {'genre': 'drama'}},
       {'id': "vec2", "values":[0.2, 0.3, 0.4, 0.5], "metadata": {'genre': 'action'}},
   ],
   namespace='example-namespace'
)
```

- Here, `str(uuid.uuid4())` is used as `id` instead of a string of an incrementing integer since the number of vectors is huge. Approximately, 500 vectors per PDF. 
- The original text, file name, and the page number of the text are stored as metadata.
  - For demo purpose, only this information is stored. More information such as paper title, publish date, and authors etc. can be stored as metadata.

### Batch Upload

Due to efficiency, vectors are uploaded to Pinecone in batches. By default, the batch size is `32`. 

In [7]:
for file_path in PAPER_LIST:
    loader = PyPDFLoader(file_path)
    pages = loader.load_and_split()
    print(f"Processing [{file_path}]")
    print(f"Pages shape: {len(pages)}")

    text_splitter = TokenTextSplitter(
        chunk_size=SPLITTER_CHUNK_SIZE, 
        chunk_overlap=SPLITTER_CHUNK_OVERLAP
    )

    source = pages[0].metadata["source"]

    total_sentences = []
    page_number_list = []
    for idx, page in enumerate(pages):
        page_num = page.metadata["page"] + 1
        sentences = text_splitter.split_text(page.page_content)
        total_sentences += sentences
        page_number_list += [page_num] * len(sentences)

    # Due to OpenAPI rate limitation, I have to embed multiple chunks at the same time
    paper_embedding = embedding_model.embed_documents(total_sentences)

    # Reformat the vectors
    to_upsert = []
    for i, sentence_vector in enumerate(paper_embedding):
        to_upsert.append({
            "id": str(uuid.uuid4()),
            "values": sentence_vector,
            "metadata": {
                            "text": total_sentences[i],
                            "source": source,
                            "page": page_number_list[i]
                        }
        })

    # Upload the vectors in baches
    batch_size = UPLOAD_BATCH_SIZE
    n = len(to_upsert)
    print(f"Total number: {n}")

    for i in range(0, n, batch_size):
        if i + batch_size <= n:
            batch = to_upsert[i: i+batch_size]     
        else:
            batch = to_upsert[i:]

        pinecone_index.upsert(vectors=batch)
        print(f"Uploaded batch [{i} : {min(n, i+batch_size)}]")

Processing [data/paper1.pdf]
Pages shape: 42
Total number: 736
Uploaded batch [0 : 32]
Uploaded batch [32 : 64]
Uploaded batch [64 : 96]
Uploaded batch [96 : 128]
Uploaded batch [128 : 160]
Uploaded batch [160 : 192]
Uploaded batch [192 : 224]
Uploaded batch [224 : 256]
Uploaded batch [256 : 288]
Uploaded batch [288 : 320]
Uploaded batch [320 : 352]
Uploaded batch [352 : 384]
Uploaded batch [384 : 416]
Uploaded batch [416 : 448]
Uploaded batch [448 : 480]
Uploaded batch [480 : 512]
Uploaded batch [512 : 544]
Uploaded batch [544 : 576]
Uploaded batch [576 : 608]
Uploaded batch [608 : 640]
Uploaded batch [640 : 672]
Uploaded batch [672 : 704]
Uploaded batch [704 : 736]
Processing [data/paper2.pdf]
Pages shape: 26
Total number: 523
Uploaded batch [0 : 32]
Uploaded batch [32 : 64]
Uploaded batch [64 : 96]
Uploaded batch [96 : 128]
Uploaded batch [128 : 160]
Uploaded batch [160 : 192]
Uploaded batch [192 : 224]
Uploaded batch [224 : 256]
Uploaded batch [256 : 288]
Uploaded batch [288 : 320]

## Testing

Automatically test the upload results using some queries.

### Test 1

In [15]:
query = "How to treat patient with ACHD?"
print("="*30)
print(f"Test 1: {query}")
print("="*30)
query_embedding = embedding_model.embed_documents([query])
res = pinecone_index.query(query_embedding, top_k=3, include_metadata=True)
print_match(res)

Test 1: How to treat patient with ACHD?
Score: 0.86 	 Source: data/paper1.pdf 	 Page: 6

effective and/or bene ﬁcial.
Practical tip. ACHD patients should be referred early
and followed by transplant and ACHD teams to determine
optimal timing for transplant listing. HTx should be
considered as a potential management strategy in ACHD

Score: 0.83 	 Source: data/paper1.pdf 	 Page: 7
 centres with multidisciplinary
expertise in congenital heart disease and trans-plantation (Strong Recommendation, Low-Quality
Evidence).
Values and preferences. Factors unique to patients
with ACHD must be considered during HTx assessment.
RECOMMENDATION
17

Score: 0.83 	 Source: data/paper1.pdf 	 Page: 7
 HLivT has similar outcomes compared with non-ACHD patients.78It should be emphasized that observationalRECOMMENDATION
16. We recommend patients with ACHD undergo eval-
uation and HTx at centres with multidisciplinary
expertise in congenital heart disease and



### Test 2

In [16]:
query = "How to reduce the cardiorenal risk?"
print("="*30)
print(f"Test 2: {query}")
print("="*30)
query_embedding = embedding_model.embed_documents([query])
res = pinecone_index.query(query_embedding, top_k=3, include_metadata=True)
print_match(res)

Test 2: How to reduce the cardiorenal risk?
Score: 0.88 	 Source: data/paper2.pdf 	 Page: 9
iorenal Risk Reduction

Score: 0.88 	 Source: data/paper2.pdf 	 Page: 7
. 1159
CCS Guideline for Cardiorenal Risk Reduction

Score: 0.88 	 Source: data/paper2.pdf 	 Page: 15
 1167
CCS Guideline for Cardiorenal Risk Reduction



### Test 3

In [17]:
query = "How to diagnose Resistant Hypertension?"
print("="*30)
print(f"Test 3: {query}")
print("="*30)
query_embedding = embedding_model.embed_documents([query])
res = pinecone_index.query(query_embedding, top_k=3, include_metadata=True)
print_match(res)

Test 3: How to diagnose Resistant Hypertension?
Score: 0.91 	 Source: data/paper3.pdf 	 Page: 4
R
C
Conﬁrm diagnosis of true resistant hypertension
Figure 1. Diagnostic algorithm for a patient with suspected resistant hypertension. ABPM, ambulatory blood pressure monitoring; BP, blood
pressure; HT, hypertension. *Three or more drugs, at optimally tolerated dosages

Score: 0.90 	 Source: data/paper3.pdf 	 Page: 4
resistant" hypertension
Refer to HT specialistConsider referral to HT specialist
R
C
Conﬁrm diagnosis of true resistant hypertension
Figure 1. Diagnostic algorithm for a patient with suspected resistant hypertension. ABPM, ambulatory blood pressure monitoring; BP, blood

Score: 0.88 	 Source: data/paper3.pdf 	 Page: 3
 for uncontrolled vs controlled RHT.20
Diagnosis of Resistant Hypertension
The diagnosis of RHT should take into account proper
ofﬁce and out-of-of ﬁce BP measurement, optimization of
pharmacotherapy taking into consideration clinical inertia,

