# RAG application using persistent and cloud-based vector stores

In this notebook, we create a RAG application with the following additional features: 
- Using a public embedding model from Hugging Face
- Applying chunking strategy for analyzing large documents
- Using a persistent vector store from Chroma DB
- A cloud-based Vector Store from Pinecone

**Case Study:**
- Search and retrieve information from Novartis 2023 annual report.

In [1]:
# import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

import pprint
# A function for printing nicely
def nprint(text, indent=2):
    pp = pprint.PrettyPrinter(indent=indent)
    pp.pprint(text)

## Parameters:

In [39]:
modelID = "gpt-3.5-turbo"

# Create a new Chroma vector store
DB_action = 'create'

# Name and location of the Chroma vector store
collection_name = 'novartis-annual'
persist_directory = '../../chromadb'

# Number of documents to retrieve
retrieve_k = 4

# Embedding model to use
embed_model_id = 'sentence-transformers/all-mpnet-base-v2'

## Reading the document file:

In [8]:
from pypdf import PdfReader

filename = '../../source_data/novartis-annual-report-2023.pdf'

reader = PdfReader(filename)
documents = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
documents = [text for text in documents if text]
print(len(documents))

283


Looking at a sample page:

In [6]:
print(documents[2])

IChair’s letter
In 2023, Novartis made another substantial step in trans -
forming from a diversified healthcare player into a focused 
innovative medicines company. With the successful spin-
off and listing of our generics and biosimilars division 
Sandoz on the SIX Swiss Exchange in October, we con -
cluded a major part of the portfolio transformation, which 
started 10 years ago and entailed the divestiture of sev -
eral non-core businesses as well as the establishment of 
new therapy and technology platforms. 
The portfolio changes are integral to our strategy, which 
aims to position Novartis in highly innovative and 
fast-growing areas of healthcare, while focusing our 
organizational and operational structure. The shift from 
taking a broad market approach to going deep into select 
medical areas to achieve category leadership is set to 
guide our strategy in the future and is designed to spur 
sales and profit growth and create sustainable share -
holder value.
We are confident

## Splitting the documents into chunks

Here we use the [SentenceTransformers/all-mpnet-base-v2](https://www.sbert.net/docs/sentence_transformer/pretrained_models.html) model and its [Hugging Face API](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) to split the documents into smaller chunks.   
We use the tokentextsplitter to split the documents using token size.  
We also preserve 10% overlap between chunks to avoid the loss of information.

In [10]:
from langchain.docstore.document import Document 
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
import numpy as np

# maximum input length for all-mpnet-base-v2
tokens_per_chunk = 384
chunk_overlap = int(tokens_per_chunk * 0.1)
model_name = 'sentence-transformers/all-mpnet-base-v2'

token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=chunk_overlap, 
                tokens_per_chunk=tokens_per_chunk,model_name=model_name)

token_split_texts = []
for text in documents:
    token_split_texts += token_splitter.split_text(text)

docs = [Document(page_content=chunk, metadata={"source": f"chunk-{i+1}"}) for i, chunk in enumerate(token_split_texts)]

  from tqdm.autonotebook import tqdm, trange


Looking at chunk results:

In [17]:
print(f'Number of documents: {len(documents)}')
print(f'Number of chunks: {len(docs)}')
token = token_splitter.count_tokens(text = docs[5].page_content)
print(f'Number of tokens in sample chunk: {token}')
print('\nA sample chunk:\n')
nprint(docs[4].page_content)

Number of documents: 283
Number of chunks: 787
Number of tokens in sample chunk: 388

A sample chunk:

('ii ceo ’ s letter 2023 was a historic year for novartis. with the sandoz '
 'spin - off largely completing the multiyear transformation of our company, '
 'we are now completely dedicated to bringing innovative medicines to the '
 'world. as we enter this new era, our very strong financial and research and '
 'development ( r & d ) performance in 2023 underscores the benefits of our '
 'focused strategy and the progress we are making in creating value for '
 'sharehold - ers and society. we continued to show leadership in oncology, '
 'with strong growth for kisqali and pluvicto and important data read - outs '
 'that show the potential to bring these medicines to broader patient '
 'populations in early breast cancer and in earlier lines of treatment for '
 'advanced prostate cancer, respectively. other standout performers include '
 'entresto, our treat - ment for heart failure an

## Creating a local vector store
We use [Chromadb](https://docs.trychroma.com/getting-started) to create a persistent vector store.   

In [19]:
import chromadb
import chromadb.utils.embedding_functions as embedding_functions

# DB_action will be changed to 'load' after collection is created
# you need to delete the existing collection if you want to create it again
# if you want to load already created collection, set DB_action = 'load'

embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2")

chroma_client = chromadb.PersistentClient(path=persist_directory)
if DB_action == 'create':
   collection = chroma_client.create_collection(name=collection_name, embedding_function=embedding_function)
   print(f'Collection {collection_name} is created, number of itmes: {collection.count()}')
   DB_action = 'load'
elif DB_action == 'load':
    collection = chroma_client.get_collection(name=collection_name, embedding_function=embedding_function)
    print(f'Collection {collection_name} is loaded, number of itmes: {collection.count()}')
elif DB_action == 'del':
    chroma_client.delete_collection(collection_name)
    collection = None 
    embedding_function = None
    print(f'Collection {collection_name} is deleted')



Collection novartis-annual is loaded, number of itmes: 787


## Local embeddings
We use the same embedding model we used for splitting also to convert chunk texts into vectors.

In [31]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma

embedding_function = HuggingFaceEmbeddings(model_name=embed_model_id)

  warn_deprecated(


In [None]:
# Upsert documents to the collection
vectordb = Chroma.from_documents(documents=docs, embedding=embedding_function,\
          persist_directory=persist_directory, collection_name=collection_name)
vectordb.persist()
vectordb = None

# Retrieval from the persistent vector store

defining the retriever:

In [40]:
vectorstore = Chroma(persist_directory=persist_directory, collection_name=collection_name, embedding_function=embedding_function) 
print(f'Collection {collection_name} is loaded, number of itmes: {collection.count()}')
retriever_rag = vectorstore.as_retriever(search_kwargs={"k": retrieve_k})


Collection novartis-annual is loaded, number of itmes: 787


Manual retrieval:

In [36]:
# question = "Can you summarize Novartis' strategic direction and key initiatives for the future?"
question = "What are the significant milestones in Novartis' research and development pipeline mentioned in the 2023 report?"
retrieved_docs = retriever_rag.invoke(question)
print(len(retrieved_docs))
nprint(retrieved_docs[0].page_content)

4
('item 5. operating and financial review and prospects 875. c research and '
 'development, patents and licenses our research and development spending from '
 'continu - ing operations totaled usd 11. 4 billion, usd 9. 2 billion and usd '
 '8. 6 billion ( non - ifrs measure core research and development from '
 'continuing operations usd 8. 6 billion, usd 8. 3 billion and usd 8. 2 '
 'billion ) for the years 2023, 2022 and 2021, respectively. novartis has '
 'numerous products in various stages of development. for further information '
 'on these policies and these products in development, see “ item 4. infor - '
 'mation on the company — item 4. b business overview. ” as described in the '
 'risk factors section and else - where in this annual report, our drug '
 'development efforts are subject to the risks and uncertainties inher - ent '
 'in any new drug development program. due to the risks and uncertainties '
 'involved in progressing through preclinical development and clinica

Question answering from the persistent vector store:

In [41]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

llm = ChatOpenAI(temperature = 0.0, model=modelID)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever_rag,
)

In [59]:
completion_chroma = qa.run(question)
nprint(completion_chroma)

('In the 2023 report, Novartis mentioned the significant milestone of '
 'acquiring Chinook Therapeutics, which added two promising Phase III assets '
 'for IgA nephropathy to their pipeline. Additionally, they highlighted the '
 'approval of IPTA-copan to treat a rare blood disorder, which was the first '
 'of potentially many approved indications for this molecule discovered and '
 'developed by Novartis.')


## Creating a vector store in the cloud using Pinecone

Now we use [Pinecone](https://docs.pinecone.io/guides/get-started/quickstart) to create a vector store in the cloud.
You need to create an account and get an API key from Pinecone.   
Store the API key in [.env](../../.env) file as PINECONE_API_KEY=YOUR_API_KEY.   
You can use the free tier of Pinecone up to 2GB.

In [46]:
import pinecone
from pinecone import Pinecone, ServerlessSpec

import time
import os

pc = Pinecone()
print(pc.list_indexes())

novartis-annual
{'indexes': [{'dimension': 256,
              'host': 'langchain-kontakts-openailarge-8aipd99.svc.aped-4627-b74a.pinecone.io',
              'metric': 'dotproduct',
              'name': 'langchain-kontakts-openailarge',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}}]}


A sample embedding to check the required size of the vector store:

In [51]:
embedding_function = HuggingFaceEmbeddings(model_name=embed_model_id)
sample_embedding = np.array(embedding_function.embed_query(docs[0].page_content))
print("Size of the embedding: ", sample_embedding.shape)
print("Sample entries from the embedding: ", sample_embedding[0:4])

Size of the embedding:  (768,)
Sample entries from the embedding:  [-0.00823599  0.11245028 -0.01427159 -0.02212434]


Create an index in Pinecone:

In [52]:
index_name = collection_name
index_name = index_name.lower()    
    
print(index_name)

try:
    index = pc.Index(index_name)
    print(f'index already created')    
    print(index.describe_index_stats())
except:    

    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
        )

    pc.create_index(
        name=index_name, 
        dimension=sample_embedding.shape[0], 
        metric="dotproduct",
        spec=spec
    )

    # wait for index to finish initialization
    while not pc.describe_index(index_name).status["ready"]:
        time.sleep(1)
    index_created = True
    index = pc.Index(index_name)



novartis-annual


Storing embeddings in pinecone

In [54]:
%time

from langchain.vectorstores import Pinecone

if index.describe_index_stats().total_vector_count == 0:
    docsearch = Pinecone.from_documents(docs, embedding_function, index_name=index_name)
    print(f'vectors created')
else:
    print(f'vectors already stored')

print(index.describe_index_stats())

CPU times: total: 0 ns
Wall time: 0 ns
vectors already stored
{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 787}},
 'total_vector_count': 787}


Connecting the vector store to the Langchain retriever:

In [55]:
vectorstore = Pinecone(index, embedding_function, "text")
retriever_pinecone = vectorstore.as_retriever(search_kwargs={"k": retrieve_k})
qa_pinecone = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever_pinecone,
)

In [60]:
question = "What are the significant milestones in Novartis' research and development pipeline mentioned in the 2023 report?"
completion_pinecone = qa_pinecone.run(question)
nprint(completion_pinecone)

('In the 2023 report, Novartis mentioned the significant milestone of '
 'acquiring Chinook Therapeutics, which added two promising Phase III assets '
 'for IgA nephropathy to their pipeline. Additionally, they highlighted the '
 'approval of IPTA-copan to treat a rare blood disorder, which was the first '
 'of potentially many approved indications for this molecule discovered and '
 'developed by Novartis.')


Let's compare it with the completion we obtained from the RAG based on ChromaDB VStore:

In [61]:
nprint(completion_chroma)

('In the 2023 report, Novartis mentioned the significant milestone of '
 'acquiring Chinook Therapeutics, which added two promising Phase III assets '
 'for IgA nephropathy to their pipeline. Additionally, they highlighted the '
 'approval of IPTA-copan to treat a rare blood disorder, which was the first '
 'of potentially many approved indications for this molecule discovered and '
 'developed by Novartis.')
