# Upload Papers to Pinecone Vector Storage

## Install Packages

If you are using MacOS, please use `pip3`.

`-qU` means `quiet` and `Upgrade`

In [1]:
!pip install -qU \
    langchain==0.0.276 \
    openai==0.27.10 \
    tiktoken==0.4.0 \
    pinecone-client==2.2.2 \
    wikipedia==1.4.0 \
    pypdf==3.15.4

## Import Packages

In [None]:
from langchain.embeddings import OpenAIEmbeddings   
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import TokenTextSplitter

import pinecone
import time
import uuid

from config import OPENAI_API_KEY, PINECONE_API_KEY, PINECONE_ENVIRONMENT, PINECONE_INDEX_NAME, EMBEDDING_MODEL, SPLITTER_CHUNK_SIZE, SPLITTER_CHUNK_OVERLAP, UPLOAD_BATCH_SIZE

## Global Variable

- `PAPER_LIST`: Store file paths of upload papers into a list

In [3]:
PAPER_LIST = ["data/paper1.pdf", "data/paper2.pdf", "data/paper3.pdf"]

## Helper Functions

In [4]:
def print_match(result):
    for match in result['matches']:
        print("="*60)
        print(f"Score: {match['score']:.2f} \t Source: {match['metadata']['source']} \t Page: {int(match['metadata']['page'])}")
        print("="*60)
        print(f"{match['metadata']['text']}")
        print("="*60)
        print()

## Initialize OpenAI Embedding

Here `text-embedding-ada-002` embedding is used by default. Please refer to [OpenAI embedding document](https://platform.openai.com/docs/guides/embeddings/embedding-models) for more details.


In [5]:
embedding_model = OpenAIEmbeddings(
    openai_api_key=OPENAI_API_KEY, 
    model=EMBEDDING_MODEL
)

print("="*30)
print("OpenAI initialization: OK")
print("="*30)
print()

OpenAI initialization: OK



## Initialize Pinecone

If the index does not exist in your Pinecone, it will automatically create a new one. 

- `metric='cosine'`: This is often used to find similarities between different documents. The advantage is that the scores are normalized to [-1,1] range. You can choose other options listed [here](https://docs.pinecone.io/docs/indexes#distance-metrics).
- `dimension=1536`: The OpenAI `text-embedding-ada-002` embedding has a dimension of 1536
- There is a limitation for the free plan for Pinecone. Please refer to the [starter plan](https://docs.pinecone.io/docs/indexes#starter-plan) for more details

In [6]:
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENVIRONMENT
)

if PINECONE_INDEX_NAME not in pinecone.list_indexes():
    # we create a new index if it doesn't exist
    pinecone.create_index(
        name=PINECONE_INDEX_NAME,
        metric='cosine',
        dimension=1536  # 1536 dim of text-embedding-ada-002
    )
    # wait for index to be initialized
    time.sleep(1)

pinecone_index = pinecone.Index(PINECONE_INDEX_NAME)
pinecone_stats = pinecone_index.describe_index_stats()
print("="*30)
print("Pinecone initialization: OK")
print(pinecone_stats)
print("="*30)
print()

Pinecone initialization: OK
{'dimension': 1536,
 'index_fullness': 0.01366,
 'namespaces': {'': {'vector_count': 1366}},
 'total_vector_count': 1366}



## Upload PDF Files

### Load PDFs

`PyPDF` is used to load the PDFs and [Tiktoken Splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/split_by_token#tiktoken) is used to split the document by tokens. This tokenizer is created by OpenAI, so it is more accurate for OpenAI models. 

- `chunk_size=100`: Combine 100 tokens into a chunk in order to give the model more context for each source it retrieves from the vector store.
- `chunk_overlap=25`: Considering most papers have two columns, so there will be more cut-off and broken texts. 1/4 of the chunk size is used for overlapping to mitigate the effect of cut-off and broken texts.

### Embedding

Then, each chunk is embedded to a vector with a dimension of 1536 using the OpenAI embedding model mentioned above. However, in order to upload the vectors to Pinecone, the vectors has to be in [this format](https://docs.pinecone.io/docs/python-client#indexupsert). 

```
upsert_response = index.upsert(
   vectors=[
       {'id': "vec1", "values":[0.1, 0.2, 0.3, 0.4], "metadata": {'genre': 'drama'}},
       {'id': "vec2", "values":[0.2, 0.3, 0.4, 0.5], "metadata": {'genre': 'action'}},
   ],
   namespace='example-namespace'
)
```

- Here, `str(uuid.uuid4())` is used as `id` instead of a string of an incrementing integer since the number of vectors is huge. Approximately, 300 vectors per PDF. 
- The original text, file name, and the page number of the text are stored as metadata.
  - For demo purpose, only this information is stored. More information such as paper title, publish date, and authors etc. can be stored as metadata.

### Batch Upload

Due to efficiency, vectors are uploaded to Pinecone in batches. By default, the batch size is `32`. 

In [7]:
for file_path in PAPER_LIST:
    loader = PyPDFLoader(file_path)
    pages = loader.load_and_split()
    print(f"Processing [{file_path}]")
    print(f"Pages shape: {len(pages)}")

    text_splitter = TokenTextSplitter(
        chunk_size=SPLITTER_CHUNK_SIZE, 
        chunk_overlap=SPLITTER_CHUNK_OVERLAP
    )

    source = pages[0].metadata["source"]

    total_sentences = []
    page_number_list = []
    for idx, page in enumerate(pages):
        page_num = page.metadata["page"] + 1
        sentences = text_splitter.split_text(page.page_content)
        total_sentences += sentences
        page_number_list += [page_num] * len(sentences)

    # Due to OpenAPI rate limitation, I have to embed multiple chunks at the same time
    paper_embedding = embedding_model.embed_documents(total_sentences)

    # Reformat the vectors
    to_upsert = []
    for i, sentence_vector in enumerate(paper_embedding):
        to_upsert.append({
            "id": str(uuid.uuid4()),
            "values": sentence_vector,
            "metadata": {
                            "text": total_sentences[i],
                            "source": source,
                            "page": page_number_list[i]
                        }
        })

    # Upload the vectors in baches
    batch_size = UPLOAD_BATCH_SIZE
    n = len(to_upsert)
    print(f"Total number: {n}")

    for i in range(0, n, batch_size):
        if i + batch_size <= n:
            batch = to_upsert[i: i+batch_size]     
        else:
            batch = to_upsert[i:]

        pinecone_index.upsert(vectors=batch)
        print(f"Uploaded batch [{i} : {min(n, i+batch_size)}]")

## Testing

Automatically test the upload results using some queries.

### Test 1

In [8]:
query = "Why the dot products get large in transformer?"
print("="*30)
print(f"Test 1: {query}")
print("="*30)
query_embedding = embedding_model.embed_documents([query])
res = pinecone_index.query(query_embedding, top_k=3, include_metadata=True)
print_match(res)

Test 1: Why the dot products get large in transformer?
Score: 0.80 	 Source: data/test.pdf 	 Page: 4
 values of
dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has
extremely small gradients4. To counteract this effect, we scale the dot products by1√dk.
3.2.2 Multi-Head Attention
Instead of performing a single attention function with dmodel-dimensional keys, values and queries,
we found it beneficial to linearly project the queries, keys and values htimes with different, learned
linear projections to

Score: 0.79 	 Source: data/test.pdf 	 Page: 4

we found it beneficial to linearly project the queries, keys and values htimes with different, learned
linear projections to dk,dkanddvdimensions, respectively. On each of these projected versions of
queries, keys and values we then perform the attention function in parallel, yielding dv-dimensional
4To illustrate why the dot products get large, assume that the components of qandkare independent

### Test 2

In [9]:
query = "How to treat patient with ACHD?"
print("="*30)
print(f"Test 2: {query}")
print("="*30)
query_embedding = embedding_model.embed_documents([query])
res = pinecone_index.query(query_embedding, top_k=3, include_metadata=True)
print_match(res)

Test 2: How to treat patient with ACHD?
Score: 0.86 	 Source: data/paper1.pdf 	 Page: 6
oided in patients with AR.RECOMMENDATION
15. We recommend early referral for assessment of HTx
in patients with ACHD with progressive cardiacsymptoms despite optimal medical, surgical, and
interventional therapies (Strong Recommendation,
Very Low-Quality Evidence).340 Canadian Journal of Cardiology
Volume 36 2020

Score: 0.85 	 Source: data/paper1.pdf 	 Page: 6
. ACHD patients should be referred early
and followed by transplant and ACHD teams to determine
optimal timing for transplant listing. HTx should be
considered as a potential management strategy in ACHDpatients even when some surgical options might be
available.
Anatomic substrate, including systemic right vs left
ventricle and single vs biventricular heart has been associatedwith pretransplantation outcomes.
59,65Brain natriuretic pep

Score: 0.85 	 Source: data/paper1.pdf 	 Page: 7
 trans-plantation (Strong Recommendation, Low-Quality
Evide

### Test 3

In [10]:
query = "How to diagnose Resistant Hypertension?"
print("="*30)
print(f"Test 3: {query}")
print("="*30)
query_embedding = embedding_model.embed_documents([query])
res = pinecone_index.query(query_embedding, top_k=3, include_metadata=True)
print_match(res)

Test 3: How to diagnose Resistant Hypertension?
Score: 0.91 	 Source: data/paper3.pdf 	 Page: 4
R
C
Conﬁrm diagnosis of true resistant hypertension
Figure 1. Diagnostic algorithm for a patient with suspected resistant hypertension. ABPM, ambulatory blood pressure monitoring; BP, blood
pressure; HT, hypertension. *Three or more drugs, at optimally tolerated dosages, and preferably including a diuretic.yHome BP monitoring can be
performed if ABPM is not accessible.628 Canadian Journal of Cardiology
Volume 36 2020

Score: 0.88 	 Source: data/paper3.pdf 	 Page: 3
Hypertension Canada ’s 2020 Guidelines on
Resistant Hypertension
Epidemiology of RHT
RHT is de ﬁned as having a BP above target with use of at
least 3 medications, at optimal doses, including a diuretic.True RHT is diagnosed when causes of pseudoresistance and
secondary causes are further excluded. Causes of pseudore-
sistance, also termed “apparent treatment-resistant hyperten-
s

Score: 0.87 	 Source: data/paper3.pdf 	 Page: 9
C

### Test 4

In [11]:
query = "How to reduce the cardiorenal risk?"
print("="*30)
print(f"Test 4: {query}")
print("="*30)
query_embedding = embedding_model.embed_documents([query])
res = pinecone_index.query(query_embedding, top_k=3, include_metadata=True)
print_match(res)

Test 4: How to reduce the cardiorenal risk?
Score: 0.88 	 Source: data/paper2.pdf 	 Page: 15
ca.2022.04.029 .Mancini et al. 1167
CCS Guideline for Cardiorenal Risk Reduction

Score: 0.88 	 Source: data/paper2.pdf 	 Page: 9
and hospitalization for HF (Strong recommendation,Moderate-Quality Evidence).Mancini et al. 1161
CCS Guideline for Cardiorenal Risk Reduction

Score: 0.87 	 Source: data/paper2.pdf 	 Page: 13
 1165
CCS Guideline for Cardiorenal Risk Reduction

