This notebook is adapted from Cohere's notebook on using a [RAG enabled Chatbot](https://github.com/cohere-ai/notebooks/blob/main/notebooks/llmu/RAG_with_Chat_Embed_and_Rerank.ipynb?ref=cohere-ai.ghost.io) for the WiDS GenAI workshop.

We will use Cohere's streaming chat bot, then add documents and received citations for which are used to answer the query. 

In [1]:
import cohere
import config
import os
os.chdir("..")

co = cohere.Client(config.COHERE_KEY) 


## Quick Example

In [3]:
documents = [
    {
        "title": "Tall penguins",
        "text": "Emperor penguins are the tallest."},
    {
        "title": "Penguin habitats",
        "text": "Emperor penguins only live in Antarctica."},
    {
        "title": "What are animals?",
        "text": "Animals are different from plants."}
]

In [4]:
# Get the user message
message = "What are the tallest living penguins?"


# Generate the response
response = co.chat_stream(message=message,
                          model="command-r-plus",
                          documents=documents)

# Display the response
citations = []
cited_documents = []

for event in response:
    if event.event_type == "text-generation":
        print(event.text, end="")
    elif event.event_type == "citation-generation":
        citations.extend(event.citations)
    elif event.event_type == "stream-end":
        cited_documents = event.response.documents

# Display the citations and source documents
if citations:
  print("\n\nCITATIONS:")
  for citation in citations:
    print(citation)

  print("\nDOCUMENTS:")
  for document in cited_documents:
    print(document)

The tallest living penguins are Emperor penguins. They are native to Antarctica.

CITATIONS:
start=32 end=49 text='Emperor penguins.' document_ids=['doc_0']
start=59 end=80 text='native to Antarctica.' document_ids=['doc_1']

DOCUMENTS:
{'id': 'doc_0', 'text': 'Emperor penguins are the tallest.', 'title': 'Tall penguins'}
{'id': 'doc_1', 'text': 'Emperor penguins only live in Antarctica.', 'title': 'Penguin habitats'}


## Vitamin studies example
Document chunking strategy can be a project in its own, with differing strategies. To keep this demo simple we will use text stored in a csv, from the National Health Institute's public database on vitamin and supplement studies. Other use cases for text in csv could be question and answer data from past customer interactions, a personal health journal or scraped forum data.

**Note:** The following example is used primarily because the data is public, and a good model of how RAG can improve generation. A more effective use case is when working with private documents such as internal policy docs, protected health information or customer service chat histories (using a private LLM) that models do not generally have access to.

### Prep data

Data exported from NIH CARDS database searching for various vitamins, supplements and restricted to human studies: https://cards.od.nih.gov/Application/Search

In [5]:
import pandas as pd
pd.set_option("max_colwidth", 300)

In [6]:
studies_df=pd.read_csv('./data/nih_vitamin_studies.csv').drop('Unnamed: 13',axis=1)

In [7]:
studies_df.head(2)

Unnamed: 0,Type,Activity Code,Project ID,Project Title,SubProject ID,Project End Date,Investigator Names,Country,Total Dollars Amount,Fiscal Year,Funding IC,Project Abstract,Public Health Relevance
0,5,K23,DK084115,Investigations into the Glutamine-Citruline-Arginine Pathway in Sepsis,,31-Dec-2015,"Kao, Christina",UNITED STATES,171885,2014,DK,"DESCRIPTION (provided by applicant): Dr. Christina Kao is an assistant professor in the Section of Pulmonary, Critical Care, and Sleep Medicine at Baylor College of Medicine (BCM). Her short-term goal is to develop the knowledge and skills to conduct metabolic research using stable isotope ...","PUBLIC HEALTH RELEVANCE: Two compounds, arginine and glutamine, help our body fight severe infections. This research will help to better define the relationship between these two compounds and will determine if supplying more of one (glutamine) will in turn lead to increases in the other (argini..."
1,5,R01,HD072120,Early Childhood Development for the Poor: Impacting at Scale,,31-May-2018,"Meghir, Konstantinos",UNITED STATES,427218,2014,HD,"DESCRIPTION (provided by applicant): Neurobiological science has established that the first three years of life lay the basis for lifelong outcomes. But during this life-cycle stage children living in poverty are often vulnerable to negative influences including malnutrition, illnesses, an ...","PUBLIC HEALTH RELEVANCE: By identifying cost-effective and scalable early-years interventions, our research has the potential to revolutionize early childhood development (ECD) policies and contribute to breaking the intergenerational cycle of poverty. If the interventions we propose to test pr..."


In [8]:
studies_df=studies_df.fillna('')

In [9]:
# Limit to key data columns and keep project ID and Investator Names for later citation reference
studies_df=studies_df[['Project Title','Project ID','Investigator Names','Project Abstract']]

In [76]:
# Remove duplicates- lots of similar studies with same title
studies_df=studies_df.drop_duplicates(subset=['Project Title'])

In [77]:
# Convert to list of dictionaries for Cohere API
documents=studies_df.to_dict('records')

In [78]:
# Sample
documents[15]

{'Project Title': 'Impact of Emergency Department Probiotic Treatment of Pediatric Gastroenteritis',
 'Project ID': 'HD071915',
 'Investigator Names': 'Freedman, Stephen ; Schnadower, David ',
 'Project Abstract': "     DESCRIPTION (provided by applicant): Acute gastroenteritis (AGE) is a leading cause of malnutrition and death worldwide. In the US, close to 48 million people contract AGE and 128,000 are hospitalized each year . Episodes of AGE can result in substantial morbidity to children and their families. In addition, the costs to caregivers, the health-care system, and society are significant. At present, treatment options are limited and targeted at symptom management rather than disease modification. Probiotics - live microbial cultures which, when consumed in adequate amounts, confer documented health benefits - may be an ideal solution. They are hypothesized to work via a combination of direct microbiologic and immunologic mechanisms. Probiotics have shown promise in early c

### Question without documents
First let's ask some questions without supplying any additional context.

In [100]:
# Get the user message
message = "Can vitamin D fight cancer?"

# Generate the response
response = co.chat_stream(message=message,
                          model="command-r-plus",)

In [101]:
for event in response:
    if event.event_type == "text-generation":
        print(event.text, end="")
    elif event.event_type == "citation-generation":
        citations.extend(event.citations)


While vitamin D is known for its role in bone health, there is ongoing research investigating its potential benefits in cancer prevention and treatment. Here's an overview of the current understanding of the relationship between vitamin D and cancer:

1. Cancer Prevention:
   - Some observational studies suggest that higher levels of vitamin D in the body may be associated with a lower risk of certain types of cancer, including colorectal, breast, and prostate cancer. However, it's important to note that these studies show correlation, not causation.
   - Vitamin D is believed to have anti-cancer properties because it may help regulate cell growth and differentiation, inhibit tumor angiogenesis (formation of new blood vessels that feed tumors), and promote cell death in cancer cells.

2. Cancer Treatment:
   - In vitro (laboratory) and animal studies have shown that vitamin D and its analogs can slow or prevent the growth of cancer cells and may enhance the effectiveness of certain che

### Send bulk documents directly with question
Vitamin D deficiency and supplementation is pretty well studied, so the results above aren't bad, but they do not provide sources. Let's directly send the same question along with some documents as context.

In [102]:
# Get the user message
message = "Can vitamin D fight cancer?"

# Generate the response
response = co.chat_stream(message=message,
                          model="command-r-plus",
                          documents=documents[:20] # doesn't handle more than this many
                         )

In [103]:
# Display the response
citations = []
cited_documents = []

for event in response:
    if event.event_type == "text-generation":
        print(event.text, end="")
    elif event.event_type == "citation-generation":
        citations.extend(event.citations)
    elif event.event_type == "stream-end":
        cited_documents = event.response.documents

# Display the citations and source documents
if citations:
  print("\n\nCITATIONS:")
  for citation in citations:
    print(citation)

  print("\nDOCUMENTS:")
  for document in cited_documents:
    print(document)

There is some evidence that vitamin D can help fight cancer. Vitamin D deficiency has been linked to an increased risk of prostate cancer, and treatment with vitamin D has been shown to reduce prostate cancer disease progression in multiple studies. Additionally, vitamin D has been shown to protect against bone mineral density loss and fractures, which are common side effects of androgen deprivation therapy used in elderly prostate cancer patients. Vitamin D may also play a role in reducing the risk of HIV-related cancers, as vitamin D deficiency is widespread among HIV-infected adults and children. Furthermore, vitamin D supplementation has been found to improve cardiac function, which may be beneficial in cancer treatment. However, more research is needed to fully understand the role of vitamin D in cancer prevention and treatment.

CITATIONS:
start=61 end=137 text='Vitamin D deficiency has been linked to an increased risk of prostate cancer' document_ids=['doc_11']
start=143 end=249

**Problem!** There is a limit on how much you can send as context! The amount varies with the LLM model, but better to first find the most relevant documents and only send those as context. 
Embedders embed doc chunks/passages as vectors and allow you to do various similarity searches to find relevant documents out of the (hopefully large) dataset you embed. 


### Embed documents
Modified from Cohere's intro that uses unstructured library https://cohere.com/blog/rag-chatbot#embed-the-document-chunks

This uses [hnswlib](https://js.langchain.com/docs/integrations/vectorstores/hnswlib) an in-memory vector store. Other tools you could use include weaviate, pgvector for embedding. 



In [17]:
import hnswlib
from typing import List, Dict


In [118]:
class Vectorstore:
    """
    A class representing a collection of documents indexed into a vectorstore.

    Parameters:
    raw_documents (list): A list of dictionaries representing the sources of the raw documents. Each dictionary should have 'title' and 'url' keys.

    Attributes:
    raw_documents (list): A list of dictionaries representing the raw documents.
    docs (list): A list of dictionaries representing the chunked documents, with 'title', 'text', and 'url' keys.
    docs_embs (list): A list of the associated embeddings for the document chunks.
    docs_len (int): The number of document chunks in the collection.
    idx (hnswlib.Index): The index used for document retrieval.

    Methods:
    load_and_chunk(): Loads the data from the sources and partitions the HTML content into chunks.
    embed(): Embeds the document chunks using the Cohere API.
    index(): Indexes the document chunks for efficient retrieval.
    retrieve(): Retrieves document chunks based on the given query.
    """

    def __init__(self, document_list: List[Dict[str, str]]):
        self.document_list = document_list
        self.docs = []
        self.docs_embs = []
        self.retrieve_top_k = 10
        self.rerank_top_k = 3
        self.load_text()
        self.embed()
        self.index()

    def load_text(self) -> None:
        """
        Loads the text from the sources and chunks the HTML content.
        """
        print("Loading documents...")

        for doc_data in self.document_list:
            
            for doc in doc_data:
                self.docs.append(
                    {
                        "title": doc["Project Title"],
                        "text": doc['Project Abstract'],
                        "project_id": doc['Project ID'],
                        "authors": doc['Investigator Names'], 
                    }
                )
    
    def embed(self) -> None:
        """
        Embeds the document text using the Cohere API.
        """
        print("Embedding document chunks...")

        batch_size = 50
        self.docs_len = len(self.docs)
        for i in range(0, self.docs_len, batch_size):
            batch = self.docs[i : min(i + batch_size, self.docs_len)]
            texts = [item["text"] for item in batch]
            docs_embs_batch = co.embed(
                texts=texts, model="embed-english-v3.0", input_type="search_document"
            ).embeddings
            self.docs_embs.extend(docs_embs_batch)

    def index(self) -> None:
        """
        Indexes the document chunks for efficient retrieval.
        """
        print("Indexing document chunks...")

        self.idx = hnswlib.Index(space="ip", dim=1024)
        self.idx.init_index(max_elements=self.docs_len, ef_construction=512, M=64)
        self.idx.add_items(self.docs_embs, list(range(len(self.docs_embs))))

        print(f"Indexing complete with {self.idx.get_current_count()} document chunks.")

    def retrieve(self, query: str, rerank_topk: int) -> List[Dict[str, str]]:
        """
        Retrieves document chunks based on the given query.

        Parameters:
        query (str): The query to retrieve document chunks for.

        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
        """
    
        # Dense retrieval
        query_emb = co.embed(
            texts=[query], model="embed-english-v3.0", input_type="search_query"
        ).embeddings

        doc_ids = self.idx.knn_query(query_emb, k=self.retrieve_top_k)[0][0]

        # Reranking
        rank_fields = ["title", "text"] # We'll use the title and text fields for reranking

        docs_to_rerank = [self.docs[doc_id] for doc_id in doc_ids]

        rerank_results = co.rerank(
            query=query,
            documents=docs_to_rerank,
            top_n=rerank_topk,
            model="rerank-english-v3.0",
            rank_fields=rank_fields
        )

        doc_ids_reranked = [doc_ids[result.index] for result in rerank_results.results]

        docs_retrieved = []
        for doc_id in doc_ids_reranked:
            docs_retrieved.append(
                {
                    "title": self.docs[doc_id]["title"],
                    "text": self.docs[doc_id]["text"],
                    "project_id": self.docs[doc_id]["project_id"],
                }
            )

        return docs_retrieved

In [119]:
# Create a vectorstore and embed ALL the documents
vectorstore = Vectorstore([documents])

Loading documents...
Embedding document chunks...
Indexing document chunks...
Indexing complete with 283 document chunks.


### Embed/rerank step by step
- embed docs with HNSWlib vectorstore
- embed query, search (uses knn) and retrieve top 10 documents 
- use Cohere's rerank to weight which docs are most relevant to the query

In [120]:
vectorstore.retrieve_top_k

10

In [121]:
message = "Can vitamin D fight cancer?"


In [122]:
# Embed query 
query_emb = co.embed(
        texts=[message], model="embed-english-v3.0", input_type="search_query"
    ).embeddings
# Perform a knn search and retrieve the top 10 docs
doc_ids = vectorstore.idx.knn_query(query_emb, k=vectorstore.retrieve_top_k)[0][0]
docs_to_rerank=[vectorstore.docs[doc_id] for doc_id in doc_ids]


In [123]:
# Peek at the titles retrieved
for doc in docs_to_rerank:
    print(doc['title'])

Novel randomized controlled trials of vitamin D supplementation in patients with colorectal cancer: Impact on survival and biology
Leveraging Novel Randomized Clinical Trials of Vitamin D Supplementation in Patients with Colorectal Cancer: Impact on Survival and Anti-Tumor Immunity
Project 2: Colorectal Cancer
Biological and Environmental modifiers of Vitamin D3 and Prostate Cancer Risk
High-dose Vitamin D Supplementation for ADT-induced Side Effects
Vitamin D and Follicular Lymphoma
Vitamin D3, Calcium and Inflammation, Immunomodulation and Colonic Permeability
Project 1:  The Role of Vitamin D in Protecting Against Cachexia in Cancer Patients
Vitamin D3, Calcium and Biomarkers of Gut Barrier Function
The Effects of Vitamin D on Mammographic Density and Breast Tissue


### Reranking
Reranking takes the document chunks retrieved and reranks their relevance them based on context and domain. Whereas retrieval may focus on context searches specific to the words in the query, a reranker tries to focus on the general subject of the query and will rank retrieval results by relevance to that subject.

In [125]:
# Reranking with default top_k=3
rank_fields = ["title", "text"] # We'll use the title and text fields for reranking

rerank_results = co.rerank(
    query=message,
    documents=docs_to_rerank,
    top_n=3,
    model="rerank-english-v3.0",
    rank_fields=rank_fields
)

doc_ids_reranked = [doc_ids[result.index] for result in rerank_results.results]
docs_reranked=[vectorstore.docs[doc_id] for doc_id in doc_ids_reranked]

In [128]:
for doc in docs_reranked:
    print(doc['title'])

Leveraging Novel Randomized Clinical Trials of Vitamin D Supplementation in Patients with Colorectal Cancer: Impact on Survival and Anti-Tumor Immunity
Novel randomized controlled trials of vitamin D supplementation in patients with colorectal cancer: Impact on survival and biology
The Effects of Vitamin D on Mammographic Density and Breast Tissue


In [131]:
# Pull in more with rerank topk = 5
rank_fields = ["title", "text"] # We'll use the title and text fields for reranking

rerank_results = co.rerank(
    query=message,
    documents=docs_to_rerank,
    top_n=5,
    model="rerank-english-v3.0",
    rank_fields=rank_fields
)

doc_ids_reranked = [doc_ids[result.index] for result in rerank_results.results]
docs_reranked=[vectorstore.docs[doc_id] for doc_id in doc_ids_reranked]

In [132]:
for doc in docs_reranked:
    print(doc['title'])

Leveraging Novel Randomized Clinical Trials of Vitamin D Supplementation in Patients with Colorectal Cancer: Impact on Survival and Anti-Tumor Immunity
Novel randomized controlled trials of vitamin D supplementation in patients with colorectal cancer: Impact on survival and biology
The Effects of Vitamin D on Mammographic Density and Breast Tissue
Project 1:  The Role of Vitamin D in Protecting Against Cachexia in Cancer Patients
Vitamin D3, Calcium and Inflammation, Immunomodulation and Colonic Permeability


## Pull it together

In [127]:
message = "Can vitamin D fight cancer?"
embedded_documents= vectorstore.retrieve(message,rerank_topk=5)

In [114]:
# Generate the response
response = co.chat_stream(message=message,
                          model="command-r-plus",
                          documents= embedded_documents
                         )

In [115]:
# Display the response using default of 3 top
citations = []
cited_documents = []

for event in response:
    if event.event_type == "text-generation":
        print(event.text, end="")
    elif event.event_type == "citation-generation":
        citations.extend(event.citations)
    elif event.event_type == "stream-end":
        cited_documents = event.response.documents

# Display the citations and source documents
if citations:
  print("\n\nCITATIONS:")
  for citation in citations:
    print(citation)

  print("\nDOCUMENTS:")
  for document in cited_documents:
    print(document)

While there is evidence that vitamin D possesses anti-neoplastic activity, it is unclear whether vitamin D can fight cancer. Preclinical and epidemiologic data suggest that individuals with higher plasma 25-hydroxyvitamin D [25(OH)D] levels have a lower risk of colorectal cancer (CRC) and improved survival from CRC. However, it is not yet known whether these findings reflect a true causal relationship. Randomized clinical trials are being conducted to test the hypothesis that vitamin D supplementation leads to improved survival in CRC patients. Additionally, there is evidence that vitamin D may play a role in breast density and breast carcinogenesis, but large-scale randomized studies are needed to examine this further.

CITATIONS:
start=29 end=73 text='vitamin D possesses anti-neoplastic activity' document_ids=['doc_0', 'doc_1']
start=125 end=159 text='Preclinical and epidemiologic data' document_ids=['doc_0', 'doc_1']
start=173 end=240 text='individuals with higher plasma 25-hydroxyv

## Do more with full tutorial!
I encourage you to work through Cohere's full tutorial which explains the steps in more detail. https://cohere.com/blog/rag-chatbot
